104 research outputs found

    MPI Thread-Level Checking for MPI+OpenMP Applications

    Get PDF
    International audienceMPI is the most widely used parallel programming model. But the reducing amount of memory per compute core tends to push MPI to be mixed with shared-memory approaches like OpenMP. In such cases, the interoperability of those two models is challenging. The MPI 2.0 standard defines the so-called thread level to indicate how MPI will interact with threads. But even if hybrid programs are more common, there is still a lack in debugging tools and more precisely in thread level compliance. To fill this gap, we propose a static analysis to verify the thread-level required by an application. This work extends PARCOACH, a GCC plugin focused on the detection of MPI collective errors in MPI and MPI+OpenMP programs. We validated our analysis on computational benchmarks and applications and measured a low overhead

    PARCOACH: Combining static and dynamic validation of MPI collective communications

    Get PDF
    International audienceNowadays most scientific applications are parallelized based on MPI communications. Collective MPI communications have to be executed in the same order by all processes in their communicator and the same number of times, otherwise it is not conforming to the standard and a deadlock or other undefined behavior can occur. As soon as the control-flow involving these collective operations becomes more complex, in particular including conditionals on process ranks, ensuring the correction of such code is error-prone. We propose in this paper a static analysis to detect when such situation occurs, combined with a code transformation that prevents from deadlocking. We focus on blocking MPI collective operations in SPMD applications, assuming MPI calls are not nested in multithreaded regions. We show on several benchmarks the small impact on performance and the ease of integration of our techniques in the development process

    Reducing Memory Requirements of Stream Programs by Graph Transformations

    Get PDF
    International audienceStream languages explicitly describe fork-join parallelism and pipelines, offering a powerful programming model for many-core Multi-Processor Systems on Chip (MPSoC). In an embedded resource-constrained system, adapting stream programs to fit memory requirements is particularly important. In this paper we present a new approach to re- duce the memory footprint required to run stream programs on MPSoC. Through an exploration of equivalent program variants, the method selects parallel code minimizing mem- ory consumption. For large program instances, a heuristic accelerating the exploration phase is proposed and evalu- ated. We demonstrate the interest of our method on a panel of ten significant benchmarks. Using a multi-core modulo scheduling technique, our approach lowers considerably the minimal amount of memory required to run seven of these benchmarks while preserving throughput

    Design-Space Exploration of Stream Programs through Semantic-Preserving Transformations

    Get PDF
    Stream languages explicitly describe fork-join parallelism and pipelines, offering a powerful programming model for many-core Multi-Processor Systems on Chip (MPSoC). In an embedded resource-constrained system, adapting stream programs to fit memory requirements is particularly important. In this paper we present a design-space exploration technique to reduce the minimal memory required when running stream programs on MPSoC; this allows to target memory constrained systems and in some cases obtain better performance. Using a set of semantically preserving transformations, we explore a large number of equivalent program variants; we select the variant that minimizes a buffer evaluation metric. To cope efficiently with large program instances we propose and evaluate an heuristic for this method. We demonstrate the interest of our method on a panel of ten significant benchmarks. As an illustration, we measure the minimal memory required using a multi-core modulo scheduling. Our approach lowers considerably the minimal memory required for seven of the ten benchmarks

    Programmation unifiée multi-accélérateur OpenCL

    Get PDF
    National audienceLe standard OpenCL propose une interface de programmation basée sur un parallé- lisme de tâches et supportée par différents types d'unités de calcul (GPU, CPU, Cell. . . ). L'une des caractéristiques d'OpenCL est que le placement des tâches sur les différentes unités de cal- cul doit être fait manuellement. Pour une machine hybride disposant par exemple de multicœur et d'accélérateur(s), l'équilibrage de charge entre les différentes unités est très difficile à obte- nir à cause de cette contrainte. C'est particulièrement le cas des applications dont le grain et le nombre des tâches varient au cours de l'exécution. Il en découle par ailleurs que le passage à l'échelle d'une application OpenCL est limitée dans le contexte d'une machine hybride. Nous proposons dans cet article de remédier à cette limitation en créant une unité virtuelle et paral- lèle de calcul, regroupant les différentes unités de la machine. Le placement manuel d'OpenCL cible cette unité virtuelle, et la responsabilité du placement sur les unités réelles est laissée à un support exécutif. Ce support exécutif se charge d'effectuer les transferts de données et les placements des tâches sur les unités réelles. Nous montrons que cette solution permet de simpli- fier grandement la programmation d'applications pour architectures hybrides et cela de façon efficace

    A Benchmark-based Performance Model for Memory-bound HPC Applications

    Get PDF
    International audienceThe increasing computation capability of servers comes with a dramatic increase of their complexity through many cores, multiple levels of caches and NUMA architectures. Exploiting the computing power is increasingly harder and programmers need ways to understand the performance behavior. We present an innovative approach for predicting the performance of memory-bound multi-threaded applications. It relies on micro-benchmarks and a compositional model, combining measures of micro-benchmarks in order to model larger codes. Our memory model takes into account cache sizes and cache coherence protocols, having a large impact on performance of multi-threaded codes. Applying this model to real world HPC kernels shows that it can predict their performance with good accuracy, helping taking optimization decisions to increase application's performance

    Automated Code Generation for Lattice Quantum Chromodynamics and beyond

    Get PDF
    We present here our ongoing work on a Domain Specific Language which aims to simplify Monte-Carlo simulations and measurements in the domain of Lattice Quantum Chromodynamics. The tool-chain, called Qiral, is used to produce high-performance OpenMP C code from LaTeX sources. We discuss conceptual issues and details of implementation and optimization. The comparison of the performance of the generated code to the well-established simulation software is also made

    Performance modeling for power consumption reduction on SCC

    Get PDF
    International audienceAs power is becoming one of the biggest challenge in high performance computing, we are proposing a performance model on the Single-chip Cloud Computer in order to predict both power consumption and runtime of regular codes. This model takes into account the frequency at which the cores of the SCC chip operate. Thus, we can predict the execution time and power needed to run the code for each available frequency. This allows to choose the best frequency to optimize several metrics such as power efficiency or minimizing power consumption, based on the needs of the application. Our model only needs some parameters that are code dependent. These parameters can be found through static code analysis. We validated our model by showing that it can predict performance and find the optimal frequency divisor to optimize energy efficiency on several dense linear algebra codes

    On the Equivalence of Two Systems of Affine Recurrence Equations

    Get PDF
    This paper deals with the problem of deciding whether two Systems of Affine Recurrence Equations are equivalent or not. A solution to this problem would be a first step toward algorithm recognition, an important tool in program analysis, optimization and parallelization. We first prove that in the general case, the problem is undecidable. The proof is by reducing any instance of Hilbert's tenth problem (the solution of Diophantine equations) to the equivalence of two SAREs. We then show that there neverthele- ss exists a semi-algorithm, in which the key ingredient is the computation of transitive closures of affine relations. This is again an undecidable problem which has been extensively studied. Many partial solutions are known. We then report on a pilot implementation of the algorithm, describe its limitations, and point to unsolved problems

    Automatic OpenCL Task Adaptation for Heterogeneous Architectures

    Get PDF
    International audienceOpenCL defines a common parallel programming language for all devices, although writing tasks adapted to the devices, managing communication and load-balancing issues are left to the programmer. In this work, we propose a novel automatic compiler and runtime technique to execute single OpenCL kernels on heterogeneous multi-device architectures. The technique proposed is completely transparent to the user, does not require off-line training or a performance model. It handles communications and load-balancing issues, resulting from hardware heterogeneity, load imbalance within the kernel itself and load variations between repeated executions of the kernel, in an iterative computation. We present our results on benchmarks and on an N-body application over two platforms, a 12-core CPU with two different GPUs and a 16-core CPU with three homogeneous GPUs
    • …
    corecore